Design of advanced digital systems requires a thorough understanding of clock management circuits. The synchronous design methodology is built on the premise of a reliable clock distribution scheme. High performance applications require new clock management approaches to keep synchronous circuits funtioning properly.
Synchronous design is popular because it simplifies timing relationships, allowing the designer to focus on circuit functionality. The only performance concerns with synchronous designs are maximum clock frequency and input and output timing relationships (see Figure 1, p. 26).
Of course, every ideal solution has a few practical problems. One of the first problems one encounters is clock skew. Clock skew is the result of minor variations in the time at which clock signals arrive at their destinations, usually register clock pins. If the variations become large, then data may not be transferred reliably between registers. Shift registers are most sensitive because of their very short data path between register bits. In this situation, clock skew may cause a hold-time violation, and we say that data-slip occurs. Clock skew also limits maximum clock frequency because it subtracts time from the available clock time budget.
Clock skew occurs because the clock must be distributed throughout the system using board traces, connectors, backplanes, and chip-level input-clock driver pads and on-chip interconnect. While each of these elements may add small amounts of phase delay, the concern is that the differential delay or skew should be less than the minimum data path delay minus the register hold time.
Management of chip-level clock skew is typically handled by careful layout of the integrated circuit (IC) device. Several techniques are available, including buffered-tree, clock-grid, and clock-tree synthesis approaches. The buffered tree is routinely used on FPGA devices because the register clock multiplexer input pins create a fixed, regular load. Clock-tree synthesis is popular in ASIC designs because it optimizes performance, power, and area.
In fact, management of board-level skew can be even more difficult than managing chip-level clock skew. As faster technologies have become available, it has become necessary to use phase-locked loop (PLL) and delay-locked loop ( DLL) circuits to minimize skew. Elements that exhibit significant delay are put inside of a feedback loop, such that the effective delay can be minimized or nulled.
With proper design, the overall system clock skew can be minimized.
It's important to note that managing clock skew has nothing to do with maximum clock frequency. Clock skew depends completely on the phase delay of the clock distribution network and the data-path delay such that hold time requirements aren't violated. In older technologies, such as 7400 series TTL logic, the data-path delays were on the order of tens of nanoseconds, tolerating an equivalent amount of clock skew. In advanced deep-submicron technologies, the data-path delays are in tens of picoseconds, demanding much more care in the design process.
Clock-to-output limitations
Pushing the maximum clock frequency is another challenge. Usually, the first limitation is the time it takes to transmit a signal from one chip to the next. Data is transferred to an output register synchronously with the clock, then it propagates through the output buffer, board traces, and arrives at the input of a register on another chip. The on-chip clock distribution network adds phase delay to the clock-to-out time delay. What we would like to do is to minimize this so-called clock insertion delay, such that the clock-to-out time is reduced to the register-to-output delay. By minimizing this delay, we can typically increase the system clock frequency. Here we have another application for PLL and DLL circuits (see Figures 2a and 2b).
But, what if we need to operate at even higher clock frequencies? At some point, the delays of board-level traces and IC input and output buffer delays become large enough that it's impractical to increase the clock frequency any more.
One idea is to design systems that perform operations at very high speeds, but communicate with the outside world at lower speeds. It's usually possible to increase the internal clock frequency of the IC by including a clock frequency multiplier on-board. This technique has been used for years to make microprocessors run at very high clock frequencies relative to the motherboard clock frequency.
Of course, it would be nice if the generated clock could have a fixed relationship to the slower board-level system clock. Once again, PLL and DLL circuits come to the rescue. This time, instead of nulling delay in a feedback loop, the circuits are used, for example, with a clock divider in the feedback loop. This tricks the circuit into generating a clock which runs at twice the reference frequency (see Figures 3 & 4).
At some point, the synchronous design style simply can't be made to go any faster. This is due to the interlocking nature of the clock skew, clock-to-out time, and set-up time issues. However, we can adopt architectural features such that data is processed in wider word widths and with more pipe-line stages. But, if we want to transfer large amounts of data as fast as the IC can process it, then we need lots of wires and lots of board space. The current trend in I/O design is to use high- speed serial transmission techniques, which use only a few wires and minimum board space.
Source synchronous clocking
A technique called source synchronous clocking can be used to move data at very high speeds, while avoiding the classical clock skew, clock-to-out time, and set-up time problems. Source synchronous clocking sends the clock in parallel with the data. The receiving end then uses the clock to recover the data. The recovered data is synchronized to the receiver's local clock with a FIFO data buffer.
The source synchronous approach allows data to be transferred at the maximum possible date rate, but with some latency. If both phases of the clock are used, then two bits of data can be sent each clock cycle. This is commonly called using both edges of the clock. This techniques is often used in conjunction with LVDS differential I/O transceivers to create high-speed serial data links running up to 1 Gbits per second.
For example, in an OC-12 application, a processor can emit bytes at a 78-MHz rate and transfer the bytes over two LVDS channels at 622 Mbps. The 78-MHz clock is multiplied by four with a PLL or DLL to produce 311 MHz. This 311-MHz clock is transmitted over one LVDS channel. The data is serialized and transferred over the second channel, two bits each clock cycle (see Figure 5).
The receiver uses the transmitted clock to de-serialize the incoming data. The de-serializer recovers two bits of data each clock cycle. Another PLL or DLL divides down the 311-MHz clock to create 155-MHz and 78-MHz clocks to move the receive data as bytes into a FIFO where it can be read at 78 MHz with the local system clock. The FIFO synchronizes the receive-data stream to the local system clock.
The highest speed serial data transmission schemes encode the data and clock together. For example, Gigabit Ethernet uses an 8-bit/10-bit encoding scheme to transfer 8 bits of data every 10 clock cycles over a single channel. The 8-bit/10-bit scheme maintains a balance of 1's and 0's in the data stream and provides enough edges for a PLL to lock on the clock frequency.
PLL and DLL circuits
Thus far, we have talked about a wide range of applications for phase-locked loop and delay-locked loop circuits. In many cases, either circuit can be used.
However, there are some cases where one is better than the other. In fact, some applications may strictly require the use of one circuit rather than the other.
The phase-locked loop has been in use for a number of years. A voltage-controlled oscillator is adjusted in a negative feedback loop. This loop will cause the frequency and phase of the reference and feedback inputs to be matched. The feedback loop may contain delay elements, such as slow clock buffers. The negative feedback action will insure that the output of the clock buffer is matched in frequency and phase to the reference clock.
A divide-by-M function placed in the feedback loop creates a multiply-by-M clock multiplier circuit. A divide-by-N function placed in the reference input creates a clock divider circuit. A combination of these dividers permits synthesis of a frequency, which is M/N times the reference frequency (see Figure 6).
The delay-locked loop was relatively obscure until Xilinx popularized it in the Virtex FPGA family.
The incoming clock signal is delayed such that when further processed by some circuit in the feedback loop, the output is delayed by exactly one or more clock cycles. The DLL employs an adjustable delay line element. If matched elements are cascaded internally, it's possible to create clock doubler and clock divider functions (see Figures 7 a, 7b).
PLLs and DLLs may be designed exclusively with analog or digital techniques, or a mix of the two. Most PLLs are hybrid analog and digital. The Xilinx DLL is said to be all digital. Both PLL and DLL circuits must achieve lock before the input reference signal and feedback signals are aligned. After lock is achieved, both circuit have several sources of error, or deviation of the buffered clock delay from the ideal reference clock. Phase offset is constant in nature and comes from delays on the board, input and output buffer delays, and from the PLL and DLL circuits themselves.
Jitter is random variation of delay caused by noise injected through power supplies and any type of noise induced in or generated by analog circuits. Step size errors occur in digital DLLs and PLLs and reflect the minimum resolution of the delay or VCO control. These errors create jitter-like effects as the control circuitry dithers between adjacent steps. For clock de-skew applications, the total jitter and error from all sources must be added to the hold-time specification, effectively reducing the hold-time margin.
For clock insertion delay (clock-to-out time) minimization, jitter has an effect on maximum clock frequency. However, if there is some margin available, the digital DLL has an interesting advantage over the digital PLL because the control circuitry can be turned off after lock is achieved. At this point, the DLL becomes a fixed delay element. It may exhibit some thermally induced drift, but there are no jitter-like effects from the step-size error.
A properly designed digital DLL can also allow the clock to be stopped for power-down applications. The DLL simply retains the current delay setting and, when the next clock edge comes, it's delayed by the same amount as it was before the clock stopped. If there was an offset error before the clock stopped, the same error exists upon restart. Unfortunately, PLLs accumulate frequency error and digital PLLs aren't able to recover from a stopped clock quite so gracefully. Basically, the VCO loses phase adjustment with the reference clock. Analog PLLs must restart the entire lock capture process.
For clock frequency synthesis applications such as boosting the internal clock frequency of a chip, a PLL is typically used. In this application, the hybrid analog/digital PLLs work well because they do not have step error.
Jitter due to noise and meta-stability of the phase detector reduces slack time in register-to-register transfers. This type of jitter is the most significant error in high-speed I/O applications because it affects the de-serialization process, showing up in the bit-error rate. The choice between phase-locked loop and delay-locked loop circuits should be decided by the application requirements, determining which unique features are required and which types of errors can be tolerated.
Bob Kirk is advanced engineering manager of Conversion ASICs for AMI Semiconductor. He has been with AMIS since 1973 and started the Twain Harte, CA Design Center in 1982. Kirk led the development of AMIS' NETRANS FPGA-to-ASIC and ASIC-to-ASIC conversion program.